The real functional analysis is used a lot in the ML. There is also the case where the complex analysis is used. We are not talking about the very detail of the complexy analysis, we just mention its application in the era of ML.
Basics or formulas required.
In this section, we just mention some critical formulas. In the case of complex number, we have
\[ e^{i \theta}=\cos \theta+i \sin \theta \]
The equation can be proved if we use the Taylor series of \(e^x\), \(\operatorname{cos} x\) and \(\operatorname{sin} x\) to prove it. This formula will be highlighted when we use complex analysis in the ML.
Additionally, we note that the complex number can be viewed as a special dimension so that we have \(i^2=-1\). This \(i\) will be helpful for many special computation.
Consider the rotary embedding using complex analysis
The token positional embedding is used to capture the features of token because of its position in the sequence. To put it simple, the token in the position 0 has different contribution from the token to the position 10.
The present research indicate that we wish to add the absolute embedding and the relative embedding to the token based on its position. The absolute embedding is just decided by the position of the token. And the relative embedding is decided by the relative position of the token. Note that for the embedding computation, we need to compute the attendence score between the token at the position of \(m\) and \(n\):
\[ \begin{aligned} \boldsymbol{q}_m & =f_q\left(\boldsymbol{x}_m, m\right) \\ \boldsymbol{k}_n & =f_k\left(\boldsymbol{x}_n, n\right) \\ \boldsymbol{v}_n & =f_v\left(\boldsymbol{x}_n, n\right). \end{aligned} \]
When you consider this problem, try to simulate the LLM inference (not training). The query is the new token to be predicted, and the key and value are the existing tokens. Now, we also need to consider the position info between them. The original transformer paper uses the absolute position embedding:
\[ f_{t: t \in\{q, k, v\}}\left(\boldsymbol{x}_i, i\right):=\boldsymbol{W}_{t: t \in\{q, k, v\}}\left(\boldsymbol{x}_i+\boldsymbol{p}_i\right), \]
and
\[ \begin{cases}\boldsymbol{p}_{i, 2 t} & =\sin \left(k / 10000^{2 t / d}\right) \\ \boldsymbol{p}_{i, 2 t+1} & =\cos \left(k / 10000^{2 t / d}\right)\end{cases} \]
If we think about this structure further, we found that the sin
and cos
function is the periodic functions, which means for the same relative distance, we could observe similar embedding.
Another relative positional embedding is to note that the relative position between the token \(m\) and \(n\) is \(m-n\), and the embedding is dependent on the \(m-n\), the difference. Note that we need to use \(\boldsymbol{q}_m^T\boldsymbol{k}_n\), and this should be able to reflect the relative position information between the two tokens. And the current research indicates that the relative position embedding is important for the positional information. We wish the inner product encodes position information by:
\[ \left\langle f_q\left(\boldsymbol{x}_m, m\right), f_k\left(\boldsymbol{x}_n, n\right)\right\rangle=g\left(\boldsymbol{x}_m, \boldsymbol{x}_n, m-n\right) . \]
The idea here is that we express the relative position as the information of angle rather than the position in a linear segment. And we can use complex analysis for it. It is like the signal processing where each signal has the frequency and the magnitude. Suppose \(d=2\), and we can assume the embedding information as:
\[ \begin{aligned} f_q\left(\boldsymbol{x}_q, m\right) & =R_q\left(\boldsymbol{x}_q, m\right) e^{i \Theta_q\left(\boldsymbol{x}_q, m\right)} \\ f_k\left(\boldsymbol{x}_k, n\right) & =R_k\left(\boldsymbol{x}_k, n\right) e^{i \Theta_k\left(\boldsymbol{x}_k, n\right)} \\ g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right) & =R_g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right) e^{i \Theta_g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right)} \end{aligned} \]
Thus, using this information, we have
\[ \begin{aligned} R_q\left(\boldsymbol{x}_q, m\right) R_k\left(\boldsymbol{x}_k, n\right) & =R_g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right), \\ \Theta_k\left(\boldsymbol{x}_k, n\right)-\Theta_q\left(\boldsymbol{x}_q, m\right) & =\Theta_g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right), \end{aligned} \]
After derivation, we found that if we choose the following expression, we can satisfy the condition above:
\[ \begin{aligned} f_q\left(\boldsymbol{x}_m, m\right) & =\left(\boldsymbol{W}_q \boldsymbol{x}_m\right) e^{i m \theta} \\ f_k\left(\boldsymbol{x}_n, n\right) & =\left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i n \theta} \\ g\left(\boldsymbol{x}_m, \boldsymbol{x}_n, m-n\right) & =\operatorname{Re}\left[\left(\boldsymbol{W}_q \boldsymbol{x}_m\right)\left(\boldsymbol{W}_k \boldsymbol{x}_n\right)^* e^{i(m-n) \theta}\right] \end{aligned} \]
The derivation is as the following (This is not shown in the paper, you can use the derivation below to understand the paper better):
Note that if we express two vectors as \(z_1 = a + bi\) and \(z_2 = c + di\), the inner product is \(ac + bd\). How would this be related to the multiplication of the two vectors? We actually have: \(ac+bd = \operatorname{Re}(z_1 * \overline{z_2})\).
\[ \begin{aligned} \langle f_q, f_k\rangle &= \operatorname{Re}(f_q * \overline{f_k}) \\ &= \left(\boldsymbol{W}_q \boldsymbol{x}_m\right) e^{i m \theta} \overline{\left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i n \theta}} \\ &= \left(\boldsymbol{W}_q \boldsymbol{x}_m\right) \overline{\left(\boldsymbol{W}_k \boldsymbol{x}_n\right)} e^{i(n-m) \theta} \end{aligned} \]
From the expression of \(f_q\left(\boldsymbol{x}_m, m\right) =\left(\boldsymbol{W}_q \boldsymbol{x}_m\right) e^{i m \theta}\), we can design the rotary embedding setup in the llama2. It is important to note here that we introduce the complex number since we wish to integrate the meaning of the magnitude and the angle. The embedding itself represents the magnitude, and the angle is from the position. In real matrix multiplication, we can only do the real calculation, thus, we need the mapping above.
We can put it another way. The real and imaginary part of the complex number are useful information for us. We can express the vectors using complex theory, so that we can incorporate the angle or phase information from the vectors. Finally, we still need to map back to the real operations and proceed it with useful information. It is like a auxiliary method to help process information using some intermediate state.